ENVX1002 Introduction to Statistical Methods
The University of Sydney
Feb 2025
Installation Support
If you have any trouble with installation:
The R graphical user interface. Source
RStudio IDE. Source: Januar Harianto
Statistical programming combines statistics and computer code to:
It’s like having a powerful calculator that can help us tell stories about our data in a repeatable way.
Population
Sample
Most (if not all) statistical analyses are based on samples, not populations.
How well does a sample represent the population?
Some thoughts:
Different samples give different results – suppose we have a population of 1000 trees and we randomly sample 6 tree heights. If this is done 3 times, it is likely that the samples will be different:
Sample 1: 21.66633 22.61768 22.79266 17.64633 14.50462 17.9679
Sample 2: 15.60759 14.0909 17.89364 18.10461 20.48023 22.88689
Sample 3: 17.75913 15.89302 26.43149 27.08996 14.99993 34.06894
So how do we make sense of these samples?
We can describe our samples using:
mean | median | mode
The mean is what most people call the “average”:
Mathematical notation
Where x_i is each individual value, N is population size, and n is sample size.
We can save a group of numbers in a vector called scores in R:
Manual calculations:
Excel offers several ways to calculate the mean:
Using AVERAGE function
=AVERAGE(A1:A4)
=AVERAGE(Using AutoCalculate
The median is the middle number when your data is in order:
Example: House prices ($’000s): 450, 1100, 480, 460, 470, 420, 1400, 450, 470
Order: 450, 450, 420, 460, 470, 470, 480, 1100, 1400
How is it useful?
R does all the ordering and finding the middle for us:
# House prices
prices <- c(450, 1100, 480, 460, 470, 420, 1400, 450, 470)
# Find median
median(prices)[1] 470
Which is a better measure for house prices?
Excel provides two main ways to find the median:
Using MEDIAN function
=MEDIAN(A1:A9)
=MEDIAN(Alternative method
The mode is the value that appears most frequently in your data. It’s particularly useful for:
Calculating the mode can be tricky, especially if there are multiple modes or no mode at all. This is why the mode is not commonly used in statistics.
There is no built-in function to calculate the mode, so we use the modeest package:
Loading required package: modeest
[1] 5
If you were to do it yourself, how would you do it in R?
Use the table() function to count frequencies:
Use run-length encoding after sorting:
The point is that it doesn’t matter how you calculate the mode, as long as you are able to do it. Also – if you needed this – aren’t you glad R has a package for it?
Excel provides several methods to find the mode:
=MODE.SNGL(A1:A10)
=MODE.MULT(A1:A10)
Source: Adobe Stock # 85659279
Imagine sampling seagrass blade lengths from two different sites in a marine ecosystem, and they have the same mean length of 15.2 cm. Are both sites the same?
Warning: package 'patchwork' was built under R version 4.4.1
seagrass_protected <- c(15.2, 15.0, 15.3, 15.1, 15.2)
seagrass_exposed <- c(12.0, 18.0, 14.5, 16.5, 15.0)
# Create plots for both sites
p1 <- ggplot() +
geom_point(aes(x = 1:5, y = seagrass_protected), size = 3) +
geom_hline(yintercept = mean(seagrass_protected), linetype = "dashed", color = "red") +
labs(title = "Site A: Protected Bay", x = "Measurement", y = "Length (cm)") +
ylim(10, 20)
p2 <- ggplot() +
geom_point(aes(x = 1:5, y = seagrass_exposed), size = 3) +
geom_hline(yintercept = mean(seagrass_exposed), linetype = "dashed", color = "red") +
labs(title = "Site B: Wave-exposed Coast", x = "Measurement", y = "Length (cm)") +
ylim(10, 20)
# Combine plots side by side
p1 + p2# Create our seagrass data
seagrass_protected <- c(15.2, 15.0, 15.3, 15.1, 15.2) # Protected bay
seagrass_exposed <- c(12.0, 18.0, 14.5, 16.5, 15.0) # Wave-exposed coast
# Calculate ranges
cat("Protected bay range:", diff(range(seagrass_protected)), "cm\n")Protected bay range: 0.3 cm
Wave-exposed range: 6 cm
Note
The range shows us that seagrass lengths are much more variable in the wave-exposed site!
The IQR tells us how spread out the middle 50% of our data is:
0% 25% 50% 75% 100%
15.0 15.1 15.2 15.2 15.3
[1] 0.1
[1] 2
Note
The larger IQR in the wave-exposed site shows more spread in the typical seagrass lengths
Variance measures how far data points are spread from their mean by:
Protected bay variance: 0.013 cm²
Wave-exposed variance: 5.075 cm²
Note
The larger variance in wave-exposed site shows more spread from the mean!
Standard deviation (SD, or \sigma for population, s for sample) is the square root of variance:
We can describe our seagrass lengths using mean ± standard deviation:
# Protected bay
mean_p <- mean(seagrass_protected)
sd_p <- sd(seagrass_protected)
cat("Protected bay:", round(mean_p, 1), "±", round(sd_p, 2), "cm\n")Protected bay: 15.2 ± 0.11 cm
# Wave-exposed
mean_e <- mean(seagrass_exposed)
sd_e <- sd(seagrass_exposed)
cat("Wave-exposed:", round(mean_e, 1), "±", round(sd_e, 2), "cm\n")Wave-exposed: 15.2 ± 2.25 cm
Tip
The ± tells us about the typical variation around the mean. Larger values indicate more spread!
| Measure | Protected Bay | Wave-exposed Coast | What it Tells Us |
|---|---|---|---|
| Range | 0.3 cm | 6 cm | Overall spread (sensitive to outliers) |
| IQR | 0.1 cm | 2 cm | Middle 50% spread (ignores extremes) |
| Variance | 0.01 cm² | 5.07 cm² | Average squared distance from mean |
| SD | 0.11 cm | 2.25 cm | Average distance from mean (in original units) |
Common Excel functions for measuring spread:
MAX() and MIN()=MAX(A1:A10) - MIN(A1:A10)
QUARTILE.INC()For Q1: =QUARTILE.INC(A1:A10, 1)
For Q3: =QUARTILE.INC(A1:A10, 3)
For IQR: =QUARTILE.INC(A1:A10, 3) - QUARTILE.INC(A1:A10, 1)
Statistical functions for variance and standard deviation:
VAR.S()=VAR.S(A1:A10)
STDEV.S()=STDEV.S(A1:A10)
Tip
Use .P instead of .S for population measures: - VAR.P() for population variance - STDEV.P() for population standard deviation
This presentation is based on the SOLES Quarto reveal.js template and is licensed under a Creative Commons Attribution 4.0 International License.